1 Introduction

The objective of this study is to build a classification model to predict red wine quality based on given physicochmeical characteristics. The dataset used for this analysis includes 1599 red wine samples with 11 physicochemical attributes like alcohol or pH, and 1 sensory attribute called quality. More details on the dataset can be referred to the original paper Cortez et al., 2009.

2 Exploratory Data Analysis (EDA)

In this section, illustrative and analytic data visualizations are created for all attributes to understand the variable distribution as well as the the association relationships between variables. EDA discovery would be used to help select important attributes to build red wine quality prediction model.

To begin with, let’s load the Red Wine csv file and preview the dataset.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

It’s also good to check the bottom of the dataset to make sure the format consistency and be aware of situations like comment lines at the end of a dataset.

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1594 1594           6.8            0.620        0.08            1.9
## 1595 1595           6.2            0.600        0.08            2.0
## 1596 1596           5.9            0.550        0.10            2.2
## 1597 1597           6.3            0.510        0.13            2.3
## 1598 1598           5.9            0.645        0.12            2.0
## 1599 1599           6.0            0.310        0.47            3.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1594     0.068                  28                   38 0.99651 3.42
## 1595     0.090                  32                   44 0.99490 3.45
## 1596     0.062                  39                   51 0.99512 3.52
## 1597     0.076                  29                   40 0.99574 3.42
## 1598     0.075                  32                   44 0.99547 3.57
## 1599     0.067                  18                   42 0.99549 3.39
##      sulphates alcohol quality
## 1594      0.82     9.5       6
## 1595      0.58    10.5       5
## 1596      0.76    11.2       6
## 1597      0.75    11.0       6
## 1598      0.71    10.2       5
## 1599      0.66    11.0       6

The data is formated consistently and ready for analysis. Now let’s check overall structure.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Observations:

2.1 Univariate Plot & Analysis

The focus of this section is to understand the distribution characteristics for each variable, which would help to reveal data central tendency, extreme outliers as well as any needs for data transformation.

2.1.1 Dependent Variable

We start with creating a histogram of the dependent variable quality.

Quality in our samples ranges from 3 to 8, and the majority samples have quality score 5 or 6. Noticed that the absence of “Excellent:10” or “Very bad:0” scores probably indicates there is no outliers in terms of red wine quality, although it is also possible due to that people tend to avoid giving extreme judgements.

Let’s also see the statistic summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Observations:

  • Quality score 5 and 6 also mark the 1st and 3rd quartile of data respectively.
    • 25% of samples with quality score below 5
    • 75% of samples with quality score above 6.
  • Taking the median of at least three evaluations can be considered as a measure to counter the personal bias in terms of flavor preferences. Therefore, we can assume the quality score is unbiased and score is directly related wine quality.
  • However, it might be safe to interpret that wine samples with quality score 8 are much better than those with quality score 3 , it is hard to differenciate whether the differnce in score 4 group and score 3 group is due to quality variation or due to human interpretation on number categories. For example, one expert might use 8 to represent high quality sensory experience with one sample, and another expert might think 7 already represent high quality wine. In other words, though we know the higher the better, there variance may come from human perception differences on those number categories.
  • Therefore, the next step is to create a new categorical variable “level” to represent wine quality in 3 levels to reduce the perception noise:

  • “Low”: quality score 3 or 4
  • “Medium”: quality score 5 or 6
  • “High”: quality score 7 or 8

The categorical variable level will be our new dependent variable. Instead of predicting the exact quality score, our goal is to predict the quality level given certain physicochemcial attributes.

The next step is to check the independent variable group.

2.1.2 Independent Variables

Independent variable group includes 11 attributes and all of them are continuous numerical variables. To better graphically describe the central tendency as well as detect the presence of any outliers, it would be helpful to overlay the statistics on distribution plots. Since there are 11 of them, we will write functions to first compute variable outliers and then generate a list of plots with their corresponded statistic lines.

## [[1]]

There are 3 red dash lines imposed to histogram-density plot to mark the lower-bound outlier, median and upper-bound outlier respectively.

Fixed acidity falling within a range from 4 to 16 is close to Gaussian distribution. There are a few samples above upper-bound outlier line but no special handling is required for this attribute.

## [[1]]

Volatile acidity seems to have bimodal or trimodal distribution, as the kernel density plot and original histogram suggested. This attribute seems to be a promising predictor since our new dependent variable level also has three categories. There are also a few upper-bound outliers observed, but no special outlier handling is required.

## [[1]]

Citric acid data have three peaks can be clearly observed in both histogram and density smooth line. This attribute also seems to be a promising predictor.

For this attribute, no outliers have observed in sample data.

## [[1]]

Residual sugar has a positively skewd distribution with noticable outliers. There are a couple of ways to handle outliers:

  • remove outliers from original dataset: not adopted since we are not sure whether the outliers are part of data characteristics or due to incorrect collection process.
  • data transformation:
    • scale transformation for better view of central tendency
    • actual transformaiton to variables: this step might be needed if we decide to use this variable for prediction model.

Square root or log transformations can pull in large numbers, so let’s compare them both:

The distribution produced by square root transformationdistribution still has a relatively obvious tail on the right. So for further analysis, log transformation can be applied if residual sugar is selected as predictor.

At this point, we can use log scale to better view the original data range:

From the above log-scaled plot, we can see that the upper-bound outlier for residual sugar is about 4.

## [[1]]

Chlorides distribution is similar to residual sugar, also highly skewed to the right. So we can apply log-scalling for this variable too.

Now it is more easier to see that chlorides for the majority of data falls within 0.03 to 0.1, with the median value 0.08.

## [[1]]

Free sulfur dioxide is slightly positive skewed, but no outlier handling is needed since the distribution plot captures the majority of sample data.

## [[1]]

The distribution shape for total sulfur dioxide is very similar to free sulfur dioxide: both are slightly positive skewed with no need for outlier handling.

## [[1]]

Density looks highly like a normal distribution, with data evenly distributed two sides from median. It is worth noticed that density difference is very small–less than 0.1 among all samples. The majority samples with density less than 1 makes sense since the density of ethanol is less than water. One possible causing for wine denser than water could be high sugar level since the density of sucrose is about 1.59 g/cm3. We can further compare relationship between density and residual sugar in bivariate section.

## [[1]]

pH also has a distribution very close to Gaussian/normal.

## [[1]]

Sulphates has a light tail on the positive side.

## [[1]]

Although kernel density shape smoothed out, we can still see the two peaks revealed by its histogram, which might be another promising predictor for wine quality level.

Observations for univariate exploratory:

  • 2 attributes are normallly distributed: density and pH.
    • density : there are samples with density greater than 1. The next step is to investigate the how density is related to wine quality level as well as to residual sugar.
  • the other 9 attributes are more or less positively skewed:
    • residual.sugar and chlorides are highly skewed to the right, log-scalling is applied to both variables to better view the central tendency.
    • Volatile acidity, citric.acidity and alcohol have more than one peak observed on histogram or kernel density plot, which seem to be promissing predictors for wine quality level.

2.2 Bivariate Plot & Analysis

In the above section, we found that there are some wine samples denser than water, and one possible cause could be high sulcose content. To find out, let’s investigate how density is related to residual sugar. Since residual sugar has some significant outliers, we will apply log scalling for residual sugar.

Wine samples with density greater than 1 have been defined as Dense Wine, and those with density less or equal than 1 have been defined as Regular Wine.

For Regular Wine group, the residual sugar level is condensed in the range from 100.2 (1.58) to 100.4 (2.51). While for the Dense Wine group, the majority data are within range from 2 to 4. Though the overlap between two ranges might suggest there are other parameters also contributing wine density, we can still infer that density is positive correlated to residual sugar for both dense wine and regular wine groups. For future study topics, we can run t-test to see whether the difference is statistically significant, which is not included in this study.

In order to build predictive model, we need to understand the association patterns between these physicochemical attributes and our dependent variable level.

Since residual sugar and chlorides have significant outliers, we will apply log transformation for these two variables first, then develop a function to create boxplots for all 11 attributes.

Boxplot for acidity group:

Observations: * positive correlated to level: fixed.acidity and citric.acid * negative correlated to level: volatile.acidity

Boxplot for sulfur dioxide group:

It seems that there’s no clear trend between sulfur dioxide group with wine quality level.

Observations: * positive correlated to level: sulphates and alcohol * negative correlated to level: pH * unclear trend: density

Next, let’s look at the log-transformed data group: residual sugar and chlorides.

Observations: * positive correlated to level: log-transformed residual.sugar * negative correlated to level: log-transformed chlorides

Now let’s regroup independent variables in three categories based on their relationship with dependent variable level:

  • Increasing trend group:
  • fixed.acidity
  • citric.acid
  • sulphates
  • alcohol
  • log_residual.sugar
  • Decreasing trend group: indicating negative correlation between physicochemcial attrbute and wine quality level.
  • volatile.acidity
  • pH
  • log_chlorides
  • Unclear trend group: no specific correlation pattern observed between physicochemcial attrbute and wine quality level. _ free.sulfur.dioxide
  • total.sulfur.dioxide
  • density

The above 5 physicochemcial attrbutes shown positive correlation to wine quality level can be considered as candidate predictors.

The above 3 physicochemcial attrbutes shown negative correlation to wine quality level can also be considered as candidate predictors.

The above 3 physicochemcial attrbutes will not be included for further analysis since there’s no clear correlation to wine quality level.

So far, we shrinked down predictor variables from 11 attributes to 8.

2.3 Multi-variate Plot & Analysis

According to the source paper, there might be correlation relationship among physicochemical attributes. While idealy predictor variables should be independent to each other to minimize the noise brought by colinearity. So the next step is to check the correlation among these 8 candidate predictors to help further select the independent ones.

Create correlation chart for 8 candidate predictors:

According to Evans (1996), correlation value r can be interpreted as:

  • ď‚· .00-.19 “very weak”
  • .20-.39 “weak”
  • .40-.59 “moderate”
  • ď‚· .60-.79 “strong”
  • .80-1.0 “very strong”

For the above 8 attributes, fixed.acidity shows strong positive correlation with citric.acid and strong negative correlation with pH. Thus, this attribute will not be considered as candidate predictors to minimize colinearity noise.

Now let’s create a heatmap to view the association of the 7 candidate predictors:

## $r
##                    log_chlorides citric.acid sulphates volatile.acidity
## log_chlorides                  1                                       
## citric.acid                 0.18           1                           
## sulphates                   0.28        0.31         1                 
## volatile.acidity            0.11       -0.55     -0.26                1
## pH                         -0.28       -0.54      -0.2             0.23
## alcohol                     -0.3        0.11     0.094             -0.2
## log_residual.sugar          0.12        0.17     0.011            0.024
##                        pH alcohol log_residual.sugar
## log_chlorides                                       
## citric.acid                                         
## sulphates                                           
## volatile.acidity                                    
## pH                      1                           
## alcohol              0.21       1                   
## log_residual.sugar -0.091   0.081                  1
## 
## $p
##                    log_chlorides citric.acid sulphates volatile.acidity
## log_chlorides                  0                                       
## citric.acid              2.4e-13           0                           
## sulphates                      0           0         0                 
## volatile.acidity         1.3e-05    1.8e-128   2.6e-26                0
## pH                       5.8e-31      1e-122   2.1e-15                0
## alcohol                  1.6e-35     1.1e-05   0.00018          3.2e-16
## log_residual.sugar       2.7e-06       4e-12      0.67             0.33
##                         pH alcohol log_residual.sugar
## log_chlorides                                        
## citric.acid                                          
## sulphates                                            
## volatile.acidity                                     
## pH                       0                           
## alcohol                  0       0                   
## log_residual.sugar 0.00026  0.0013                  0
## 
## $sym
##                    log_chlorides citric.acid sulphates volatile.acidity pH
## log_chlorides      1                                                      
## citric.acid                      1                                        
## sulphates                        .           1                            
## volatile.acidity                 .                     1                  
## pH                               .                                      1 
## alcohol                                                                   
## log_residual.sugar                                                        
##                    alcohol log_residual.sugar
## log_chlorides                                
## citric.acid                                  
## sulphates                                    
## volatile.acidity                             
## pH                                           
## alcohol            1                         
## log_residual.sugar         1                 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

From the heat map we can see that citric.acid also negative correlated to both volatile.acidity and pH, but since the r values (-0.55 and -0.54) only indicates moderate correlation, we can still keep it for prediction.

The next step is to build a prediction model by fitting the above 7 predictors.

3 Prediction Model

Since our dependent variable is categorical data, we need to select muli-class logistic regression model.

The dataset for prediction model should include 7 predictors and 1 target variable.

3.1 Data Partition: Training & Test

We first need to split dataset into training and test subsets. Training data for prediction model training, and test data for model performance evaluation. Random sampling will applied within dependent variable wine quality level for total 1599 samples, which will result in 70% training and 30% test data.

3.2 Random Forest Prediction Model

For this study, Random Forest algorithm is selected due to its advantage on both accuracy and efficiency.

Build prediction model using randomForest on the training set.

## 
## Call:
##  randomForest(formula = level ~ citric.acid + log_chlorides +      sulphates + volatile.acidity + pH + log_residual.sugar +      alcohol, data = candi_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.9%
## Confusion matrix:
##        Low Medium High class.error
## Low      3     41    1  0.93333333
## Medium   2    888   34  0.03896104
## High     0     89   63  0.58552632

The overall error rate estimated on training dataset is below 14%.

3.3 Model Performance

Now let’s check the prediction performance on test dataset.

##         Predicted
## Observed Low Medium High
##   Low      0     17    1
##   Medium   2    380   13
##   High     0     32   33
## Accuracy of Prediction Model is: 0.8640167

The prediction model achieved 87% accuracy rate.

In the univariate plot section, we proposed that volatile.acidity, alcohol and citric.acid would be potential predictors based on observed two or three peaks on historgram.

In the bivariate plot section, boxplots also revealed that citric.acid, sulphates, log_residual.sugar and alcohol all show positive correlation with wine quality level.

Now let’s visualize the rank of all 7 attributes based on the variable importance.

Our previous assumptions on alcohol, volatile.acidity, citric.acid are ranked as No.1, No.2, No.5 respectively. The most important variable alcohol is positively correlated to wine quality level, and the second important variable volatile.acidity is negatively correlated to wine quality level.

4 Summary

By apply random forest algorithm, we achieved 87% accuracy on predicting red wine quality level by 7 physicochemico attributes.

The process that helped us to narrow down the 7 predictors from 11 variable is mainly by EDA phase. The below section includes three final plots developed during EDA phase.

4.1 Final Plots

Final Plot 1: The reason to keep this plot is because we can see 3 peaks from histogram and 2 mode from kerney density. Together it suggested volatile.acidity might be a good predictor for wine quality level, and this has been further confirmed as we can see it is the 2nd most important variable in the Random Forest model. Another reason is that the statistic outliers are overlaied in a compact and informative way.

Final Plot 2: this plot is selected because it reveals the positive correlation between residual sugar and density. This plot also compared Regular Wine group vesus Dense Wine group, and the trend holds valid for both, though a overlaped range is observed. For future study topics, t-test can be selected to see whether the difference is statistically significant.

Final Plot 3: this plot is selected because the below five attributes show clear positive correlation to wine quality level. Alcohol and sulphate are ranked as No.1 and No.3 in terms of their importance for prediction model.

4.2 Reflection

In this study, we applied random forest algorithm to predict wine quality level. The methodology involves two major phases: EDA phase and prediction model development phase.

  • EDA phase: it is a highly iterative process and this is also the part I spent most time with. One lesson’s learned is that in EDA phase, it is very tempting to going on and on, but it is also very important to know when to stop because real projects will have time and monetary budget. Fortunately at last this EDA phase reached a relative satisfactory status.

  • Prediction Model Development: there are a few limitations in this phase.
  • fixed.acidity is eliminated as a predictor because of its colinearity. To complete the model perforamce and feature selection, we should also train a model by including fixed.acidity but removing the other two attributes associated with fixed.acidity. This way we can have a relative comprehensive model representation for potential candidates.
  • Idealy, we should divide dateset into training, validation and test subsets. By doing so, we can train a couple of prediction models by fitting training data, and then compare error rate in validation dataset, at last apply the model with best performance in validation dataset to the test dataset. In this study, the validation process is not included.

5 Reference

  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

  2. Correlation matrix: An R function to do all you need (http://www.sthda.com/english/wiki/correlation-matrix-an-r-function-to-do-all-you-need#at_pco=smlre-1.0&at_si=589272c605773292&at_ab=per-2&at_pos=3&at_tot=4)

  3. Source code for Variable Importance Plot (https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic)

  4. Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Pacific Grove, CA: Brooks/Cole Publishing.